How representative are student convenience samples? A study of literacy and numeracy skills in 32 countries

Psychological research, including research into adult reading, is frequently based on convenience samples of undergraduate students. This practice raises concerns about the external validity of many accepted findings. The present study seeks to determine how strong this student sampling bias is in literacy and numeracy research. We use the nationally representative cross-national data from the Programme for the International Assessment of Adult Competencies to quantify skill differences between (i) students and the general population aged 16–65, and (ii) students and age-matched non-students aged 16–25. The median effect size for the comparison (i) of literacy scores across 32 countries was d = .56, and for comparison (ii) d = .55, which exceeds the average effect size in psychological experiments (d = .40). Numeracy comparisons (i) and (ii) showed similarly strong differences. The observed differences indicate that undergraduate students are not representative of the general population nor age-matched non-students.


Introduction
Over the past two decades growing concerns have been raised about psychological research's overreliance on convenience samples of undergraduate students. Arnett (2008) [1] found that up to 80% of samples in APA-published studies consisted of samples of undergraduate psychology students. A decade later, Rad et al. (2018) [2] reported that although the trend was decreasing, many studies continued to rely on students. Relying heavily on student samples is an extension of the well-known bias of drawing samples from Western Educated Industrialized Rich and Democratic (WEIRD) societies [3]. Not only are student samples frequently drawn from WEIRD countries [1,2], but they are even WEIRDer within their countries given that students tend to come from higher socio-economic backgrounds, be between age 18-24, and are by nature highly educated. As such, the undergraduate sampling bias compromises one of the core goals of psychological research: external validity.
We are not the first to raise concerns about external validity and the undergraduate sampling bias (see above as well as [4]). Rather, we seek to strengthen the literature by quantifying just how well students represent the general population of their countries. Several previous a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 studies have found students to be unrepresentative in the field of cognitive psychology. Snowberg and Yariv (2021) [5] found American undergraduates exhibited greater cognitive skill and strategic sophistication than a representative sample of the United States. Similarly, Brañas-Garza et al.'s (2019) [6] cognitive meta-study found students score significantly higher than non-students on the Cognitive Reflection Test (CRT), a measure used to assess decision making processes. Performance on the CRT is also highly correlated with other cognitive measures such as the Wonderlic Personnel Test (WPT), which measures general cognitive ability, and standardized college admissions tests such as the American College Testing (ACT) and Scholastic Aptitude Test (SAT), which measure academic achievement [7]. These findings suggest that relying on undergraduate samples will be equally challenging to the generalizability of educational outcomes such as literacy and numeracy-the focus of this study.
As mentioned above, undergraduate samples do not challenge the generalizability of literacy and numeracy research simply because they are highly educated, but also because they represent a narrow age range. A number of studies indicate age is a significant predictor of cognitive skills including literacy and numeracy. For instance, Kirasic et al. (1996) [8] showed that middle aged and older adults performed worse than young adults on information processing, working memory, and declarative learning tasks, many of which tap into the component skills of literacy and numeracy. Older adults likewise perform worse on direct measures of numeracy skills than younger adults [9][10][11]. Similarly, Green and Riddell (1998) [12] and Kyröläinen and Kuperman (2021) [13] also report a negative correlation between age and performance on literacy assessments in adults aged . Therefore, samples of undergraduate students, who tend to be young adults in peak cognitive conditioning, are unlikely to be representative of the cognitive behaviours of the general population.
The current study seeks to quantify just how accurately undergraduate students represent the general population in terms of two complex cognitive skills, namely literacy and numeracy (defined below). There are at least three reasons to single out literacy and numeracy from other cognitive and social phenomena. First, research on these topics is biased towards studying student populations. University students are overrepresented as a source of empirical data in reading research, particularly when it comes to lexical mega-studies and eye-movement corpora. The English Lexicon Project, British Lexicon Project, and Dutch Lexicon Project are large scale collections of lexical decision and naming times for thousands of words in their respective languages and have been used to develop several theories of word processing [14][15][16]. Similarly, the Ghent Eye-tracking Corpus (GECO) and Multilingual Eye-tracking Corpus (MECO), which recorded eye-movement data while participants read longer texts, have been used to inform theories of reading behaviour and eye-movement control [17,18]. Each of these valuable and well cited datasets collected their data primarily or exclusively from university students. How well students represent the general population in terms of complex skills such as literacy and numeracy may be an indicator of how representative these samples are in terms of component skills such as reading behaviour, numeric reasoning, working memory, and cognitive control.
The second reason for singling out literacy and numeracy is for their societal importance. In this technological era, these advanced cognitive skills are critical for individual employability, life satisfaction, health, and for the societal and economic prosperity of nations [19,20]. Finally, literacy and numeracy are skills that students are actively trained on and selected for (e.g. [21]), whereas, on a daily basis, non-students employ these skills to more varied degrees. Over the course of their secondary schooling, individuals typically need to succeed in a series of examinations that precisely target literacy and numeracy in order to be admitted to postsecondary education. Simultaneously, an individual's perception of their literacy and numeracy levels informs their decision on whether to pursue post-secondary education [21]. This selectivity favors more literate and numerate individuals to become undergraduate students in the first place. In addition, post-secondary education further boosts students' literacy and numeracy by providing intense practice and high stakes for meeting institutional demands on these skills [22,23]. Against this background, the question is hardly whether university students differ from the broader population of language speakers. Instead, we ask just how different are they?
The present study answers this question by reporting an analysis of literacy and numeracy skills based on comparative data from 24 languages and 32 countries across 5 continents. To our knowledge, this is the first large-scale analysis that quantifies the degree to which undergraduate students represent the general population regarding literacy and numeracy. Given that undergraduate students are the population most frequently sampled in psycholinguistics, we seek to determine how different students are from (i) the general population of adults and (ii) from the age-matched non-student population, in terms of literacy and numeracy skills within and across countries.
We use the Programme for the International Assessment of Adult Competencies (PIAAC) [24] which is an international survey assessing literacy, numeracy and problem-solving skills in the adult population. PIAAC defines literacy as "understanding, evaluating, using and engaging with written texts to participate in society, to achieve one's goals, and to develop one's knowledge and potential" [25]. The PIAAC definition of numeracy is "the ability to access, use, interpret and communicate mathematical information and ideas, in order to engage in and manage the mathematical demands of a range of situations in adult life". The assessment measures reading for a purpose (i.e. to gather knowledge, evaluate the text, form an opinion etc.) [26], see methods for an example. This draws on information processing and working memory skills in addition to basic reading skills such as phonological decoding and vocabulary knowledge. Literacy and numeracy tasks in PIAAC clearly require combining and coordinating multiple cognitive processes and component skills. PIAAC only provides the scores for the most inclusive and complex literacy and numeracy task rather than the individual component skills (except for a small subset of mainly low-literacy participants [27]). Yet group differences in the participants' performance in these complex tasks enables speculation and hypothesis-building with respect to the expected differences in at least some of the required component skills.
One beneficial feature of the PIAAC data is that each participating country was required to produce a probability-based sample (with a minimum size N = 5000) representative of the population of adults aged 16 to 65 in the country. Another advantage of the PIAAC data is that the literacy and numeracy scores are psychometrically validated and directly comparable across countries and languages of administration. The result is rich data from 24 languages (including Arabic, Hebrew, Japanese, Kazakh, and Korean) adding valuable insights beyond the over-researched realm of alphabetic Indo-European languages.

Programme for the international assessment of adult competencies
We use the publicly available PIAAC data to estimate effect sizes for comparisons of literacy and numeracy skills between (i) university students and their respective country's adult population (16-65 years old), and (ii) students and non-students in the same age cohort. The more specific comparison (ii) pits undergraduate students against their own age group (16-25 years old) and thus estimates the critical difference while largely subtracting the effect of aging and the cohort effect, which are known to be pivotal in the distribution of cognitive skills in society [13,28,29].
We focus on two cognitive skills: literacy and numeracy. Both skills are assessed in PIAAC through tests that simulate the demands of work, social and everyday life on multiple skill facets [30,31]. For instance, participants may read a list of preschool rules and be asked what is the latest time that children should arrive. In the case of literacy, the test items engage all levels of reading comprehension-including decoding, knowledge of vocabulary, ability to process information at the word, sentence and discourse level, reading fluency and inferential skills-as well as ability to read digital texts (using hyperlinks and navigation). For sample items see http://www.oecd.org/skills/piaac/Literacy%20Sample%20Items.pdf.
The publicly available files with PIAAC data from 35 countries were retrieved from https:// www.oecd.org/skills/piaac/data/. We used the files from the first cycle of data collection which took place from 2011-2012 (round 1), 2014-2015 (round 2), and 2017 (round 3). The [redacted] Research Ethics Board deems this use of secondary data exempt from ethics clearance requirements. Three national samples out of the total set of 35 participating countries were removed from the analysis either because they did not contain variables critical for our analyses (Denmark, Russian Federation) or had a sample of fewer than 1000 participants after the trimming described below (Singapore). The following data-processing and trimming steps were applied to the remaining 32 datasets. First, we only considered individuals who were born in the country of test administration and were native speakers of the language in which they took the test. This restriction enabled us to filter out effects of immigration and second language acquisition on the distribution of cognitive skills in a national sample (see [32]). Individual data with missing values for education and occupational status were removed as well.
The resulting national samples and respective weights (see below) were used for estimation of literacy and numeracy skills in different population segments of respective countries. One such segment, labeled Student, included individuals between 16 and 25 years of age who were (a) studying in a formal education setting or working and studying simultaneously, and (b) had completed either upper secondary education, a bachelor's degree or a master's degree at the time of data collection. Another sample, labeled Young, incorporated all individuals in the 16-25 age range who were not part of the Student sample. The final and most inclusive sample, labeled General, consisted of all participants from the trimmed sample of a given country except those in the Student sample. That is, the General sample included the Young sample, but not those in the Student sample. Naturally, many of the participants in the General sample are also former students, which may attenuate the differences between the General and Student samples. Since neither the Young nor the General samples overlapped with the Student sample, we administered pairwise comparisons between independent samples. Sizes of all samples are reported in Table 1 for each country.

Statistical considerations
Large-scale international assessments such as PIAAC aim to test a broad range of test constructs while minimizing the response burden on the individual. As such, each participant in PIAAC only responded to a subset of test items and a set of plausible values were derived to estimate the individual's overall proficiency, including on the items they did not respond to [33]. The matrix sampling method of PIAAC determines that the sets of items that each participant encounters and responds to are not identical. To enable an accurate estimation of the measurement error, an individual score in each cognitive skill test is represented as 10 plausible estimates of what that person's performance would be. Each plausible value is defined on the test scale from 0 to 500 points. When estimating a participant's performance in, say, a literacy or numeracy task, plausible values are sampled through a bootstrapping procedure to produce both a point-wise estimate and an estimate of variability incurred by the non-identical test items that each participant encounters.
Moreover, each participant in the PIAAC survey is associated with a weight, allowing the tested person to stand for a larger segment of the population. The weights are based on census data and determined by the combination of the participant's age, gender, education, place of residence and additional factors (for details see [34]). Specifically, the PIAAC data use Jackknife Repeated Replication weights that correct for the complex designs of the samples which vary from country to country [34]. Computational procedures have been developed which process the individual plausible values and apply the appropriate weighting to derive estimates

PLOS ONE
of means and variances that are representative of a given participant sample in the given country (for more detail see [33]). In this analysis, nationally representative estimates of literacy and numeracy have been obtained for the General, Young and Student samples using the package instvy, which is provided in the statistical platform R 3.6.1 [35] and is specifically designed for the PIAAC data [36]. To quantify differences between literacy and numeracy scores between samples, we used the classic Cohen's d metric for independent samples, where the difference of means between samples is divided by the pooled standard deviation accounting for nonequal sample sizes [37,38]. Estimates of Cohen's d as an effect size metric are based on estimates of means and standard deviations corrected through weighting to be nationally representative.

PLOS ONE
aggregated data and in specific countries, the distribution of skills in each sample (General, Student, Young) is symmetrical and the Student sample is shifted to the right relative to the Young and General samples. Tables 1 and 2 report descriptive statistics and sample sizes for General, Young, and Student samples in each country for literacy and numeracy respectively. Additionally, the Tables report effect sizes (Cohen's d) of comparisons between the Student and Young populations, as well as the Student and General populations (see below).
In all countries, the mean literacy and numeracy scores of the Student samples were superior to those found among both young adults and among the general populations. On the PIAAC test scale, the mean difference between Student and General samples was 24 points for literacy and 22 points for numeracy. A comparable advantage of the Student sample over the Young sample was observed: 22 points for literacy and 25 points for numeracy. These differences are massive: that is, they are as large as or larger than the difference between the 25th and 75th percentile of literacy (20 points) and numeracy (23 points) for the General samples of all countries. The variance of scores in the Student sample was not statistically different from variances in either the Young or the General sample, neither in terms of literacy nor numeracy

PLOS ONE
(all F < 1.3, all p > 0.5 in F tests). This finding runs counter to the intuition that student samples are more homogenous-due to selectivity of educational institutions and self-selectionthan the population at large. However, the finding converges with Hanel and Vione's (2016) [39] report of a similar variability in personality traits and attitudes among students and general populations of 59 countries.
To quantify the differences in a way that is comparable to the relevant psychological literature, we calculated Cohen's d metric for independent samples. Cohen's d for the comparison of literacy scores between students and the general population ranged from negligible (d = 0.07, Cyprus) to strong (d = 0.82, Chile), with the median of d = 0.56 and d = 0.40 and  The importance of the present findings comes to light when compared against meta-analytical estimates of effect sizes of studies published in the field of psychology. An influential metaanalysis and replication of 100 experimental and correlational papers in psychology [40] places the estimated average effect size of the original studies at d = 0.403 (SD = 0.188) and that of the replications at d = 0.197 (SD = 0.257). Another meta-analysis of 447 psychological papers [41] reports a negative correlation between sample size and effect size. While their estimate of the mean effect size d across all sample sizes is close to 0.4, the largest samples in their data  [41], a predicted effect size for such samples would hover around d = 0.2. Thus, the effect sizes that we observe in our data exceed the expected values by the factor of 2 to 2.5 for both literacy and numeracy when comparing students to both the general and age-matched non-student populations. Moreover, virtually all individual countries in our analyses showed effects stronger than those expected in the published literature in the field of psychology (for variability of effect sizes across types of studies and subdisciplines of psychology see e.g., [42]). In summary, the results show that drawing conclusions about language and math functioning among groups of adult speakers based on evidence from undergraduate students comes with a strong systematic bias in many countries of the world.

Discussion
The present paper advances the research agenda that examines sampling biases in psychological research, and more specifically literacy and numeracy studies. Convenience samples of undergraduate students are over-represented in the empirical evidence base and play a disproportionately large role in scientific theory-making [1,2]. Given the common practice of using data from university students to inform theories of linguistic and cognitive processing (as reviewed in the Introduction), reading studies are similarly likely to suffer from a student sampling bias. We quantified just how well undergraduate students represent (i) the general population (age 16-65) of their country and (ii) age-matched non-students (age [16][17][18][19][20][21][22][23][24][25] in terms of literacy and numeracy skills across 32 countries and numerous languages and cultures. Most importantly for the current study, the PIAAC data avoid bias within each selected country. That is, students and all other population segments are represented with the same probability as they naturally occur in that country [43]. In all countries in the dataset, students' mean literacy and numeracy scores were far superior to those of both the non-student young adults and the general populations. While the latter fact may not seem surprising, we find it noteworthy given that many participants in the General population were former post-secondary students and furthered their cognitive skills through additional years of practice. While effect sizes varied across countries, median effect sizes in all comparisons either met or exceeded those typically found in psychological literature (d > 0.4) [40][41][42]44].
These observations lead to several striking conclusions about the practice of studying language behavior and numeracy using convenience pools of university students. First, it is inaccurate to consider students as a group representative of the population at large. They are as different from the General population (excluding students) as the 25th percentile is different from the 75th percentile in that population. Second, it is even less accurate to treat students as a group representative of non-students of the same age. To put "inaccuracy" into perspective, imagine a high powered pre-registered psychological experiment with a treatment and control group. Imagine further that this group difference shows an effect stronger than those typically observed in experimental psychology (d > 0.4). Imagine, finally, that the experimenter interprets the behavior of the treatment group as a valid approximation of the behavior of the control group. In fact, they view the results as support for the null hypothesis and claim the treatment group is representative of the entire population. This scenario is a statistical equivalent of assuming the literacy and numeracy behavior of students represents that of the general or age-matched population of speakers of the same language.
One clear theoretical impact of this mismatch between students' reading skills and those of other populations is that it raises questions about generalizability and external validity of empirical research based on the findings from literacy-and numeracy-related behaviours of undergraduate students. To be clear, sampling from student populations is not in itself a problem, so long as the findings are interpreted within the student population. Yet such disclaimers are rarely found in psychological literature (including in our own work). Consequently, readers can make the logical assumption that findings based on undergraduate student groups generalize over the entire adult population. However, as the findings above indicate, students are rarely representative of the adult population when it comes to literacy and numeracy. Therefore, we hope to have demonstrated that caution is needed when studying phenomena that rely on highly trained cognitive skills such as literacy and numeracy.

Limitations and future directions
The estimates in this study are calculated on the basis of a single, though complex and comprehensive, task. As such, we can only say with certainty that students do not represent other populations in terms of the PIAAC measures of literacy and numeracy. It is up to future research to quantify how representative undergraduate students are on other measures of literacy and numeracy. For instance, we speculate that students will not be representative of other population groups in terms of their fluency in literacy and numeracy-related tasks. Specifically, we predict students to be faster than other populations both because they showed higher accuracy in the tasks reported here, and because of multiple reports of higher being associated with higher speed of task completion: for early reports and recent reviews in reading see [45,46].
Literacy and numeracy, particularly as assessed in PIAAC, require the coordination of multiple cognitive processes and mastery of multiple component skills. We predict that students will also be unrepresentative when it comes to the component skills of reading and numeracy such as working memory, numeric reasoning, and word processing. Since PIAAC does not test these component skills directly, the current study cannot indicate whether these group differences indeed exist or whether the effect sizes will be reduced or amplified on other tasks. Future investigations should continue to quantify differences between student and other populations both on comprehensive literacy and numeracy assessments, as well as tasks targeting their component skills. This paper provides a qualitative indication that such differences are likely to be found.
The main question explored in this study-how different are students' cognitive skills from those of other population groups-is coupled with at least two other questions that are out of the scope of the present paper: (a) what contributes to these differences, and (b) how do these differences influence the inquiry into psychological traits and processes in other domains. Question (a) has been extensively covered in studies of literacy and numeracy development as well as research on post-secondary education (for select reviews see [21,[47][48][49]). We note however that the by-country breakdown of the differences between samples (reported in Tables 1 and 2) can further boost this research as these differences are likely to be co-determined by demographic and socio-economic characteristics of those countries and their investment in both the spread and quality of (post-secondary) education. The present data do not shed light on question (b), therefore we relegate further exploration of (a) and (b) to future research. We also note that the present study highlighted group differences in advanced behaviors tested in PIAAC data. These behaviors demand a proficient and coordinated use of multiple component skills, including word recognition, reading fluency and reading comprehension. How the over-reliance on sampling university students affects the accuracy of verbal and computational models of such component skills (partly discussed in the Introduction) is an important question for further examination.

Conclusion
To be sure, few researchers of literacy or numeracy are likely to endorse a premise that students accurately represent the literacy or numeracy skills of the general population. Yet it is important to realize that this premise is implicit in the common practice of reporting experimental findings or computational models based on university students without a disclaimer about their limited generalizability. We do not wish to imply that the field of language or numeracy research is ignorant of the problem. To give only a few examples to the contrary, there are ongoing efforts to study literacy in older adults [8,12,50], communities of low socioeconomic status [51], as well as readers with lower literacy or lower academic attainment populations [20,[52][53][54][55][56][57][58]. Additionally, an increasing number of comparative literacy and numeracy studies draw community or representative samples for their hypothesis testing (see among many others [13,29,[59][60][61][62]. Finally, as undergraduate sampling relates to the WEIRD bias, we also highlight the growing body of cross-linguistically comparable samples in literacy research (among others, [18,[63][64][65][66]. Still, collecting normative population-wide data is an expensive, time-consuming process, and funding agencies can be more reluctant to provide support for such endeavors than for research of groups defined by their clinical, demographic, or social status. Change in the culture of research must be complemented by change in scientific policy-making. We echo the recommendations of Henrich et al. (2010) [3] and Rad et al. (2018) [2] for researchers to explicitly address questions of generalizability in their samples, make data freely available to aid comparative research efforts, collect data broadly within their countries, and build partnerships with community members and researchers, particularly in non-WEIRD countries. Moreover, we urge funding agencies and policy-makers to recognise the importance of minimising the student sampling bias in language research and value projects with representative and non-WEIRD samples accordingly. The movement towards more inclusive data coverage and external support for such coverage is necessary to maintain high standards of psychological research.